import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
%matplotlib inline
plt.rcParams['figure.dpi'] = 70
from sklearn.linear_model import LinearRegression, LogisticRegression, BayesianRidge
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
We introduce 3 training datasets that will be used in this note.
np.random.seed(12321)
x0 = np.random.normal(size=50).reshape(-1, 2) - 1
x1 = np.random.normal(size=50).reshape(-1, 2) + 1
x2 = np.random.normal(size=50).reshape(-1, 2) + 3.
x_1 = np.random.normal(size=10).reshape(-1, 2) + np.array([11., 10.])
train_data ={'2-class': (np.concatenate([x0, x1]), np.concatenate([np.zeros(25), np.ones(25)]).astype(int)),
'2-class_ex':
(np.concatenate([x0, x1, x_1]), np.concatenate([np.zeros(25), np.ones(30)]).astype(int)),
'3-class':
(np.concatenate([x0, x1, x2]), np.concatenate([np.zeros(25), np.ones(25), 2 + np.zeros(25)]).astype(int))}
plt.figure(figsize=(15, 4))
for i, (key, value) in enumerate(train_data.items()):
plt.subplot(1, 3, i+1)
x_train, t_train = value
plt.scatter(x_train[:,0], x_train[:,1], c=t_train, cmap=matplotlib.colors.ListedColormap(['red','green','blue']))
plt.title("{}".format(key))
plt.show()
The goal in classification is to assign $p$-dimension $\mathbf{x}$ to one of $K$ classes $\mathcal{C}_k$. In general, there are two approaches to perform the task:
Nonprobabilistic approach constructs a discriminant function that directly assigns each vector $\mathbf{x}$ to a specific class.
Probabilistic approach models the conditional probability distribution $p(\mathcal{C}_k|\mathbf{x})$ in an inference stage, and then subsequently uses this distribution to make optimal decisions.
We define target vector as
\begin{align*} \mathbf{t} = (t_k: k = 1, \ldots, K)^\intercal \end{align*}where the value of $t_k$ is interpreted as the probability that the class is $\mathcal{C}_k$. For example, $\mathbf{t} = (0, 1, 0, 0, 0)^\intercal$ indicates a pattern from class 2 out of 5 classes.
Discriminant Functions¶
A $K$-class discriminant comprising $K$ linear functions of the form:
\begin{align*} y_k(\mathbf{x}) = \mathbf{w}_k^\intercal \mathbf{x} + w_{k0} \end{align*}$\mathbf{x}$ is assigned to class $\mathcal{C}_{k^\ast}$ if $y_{k^\ast}(\mathbf{x}) > y_j(\mathbf{x})$ for all $j \ne k^\ast$, i.e. class
\begin{align*} k^* = \text{argmax}_k \{y_k(\mathbf{x}): k = 1,\ldots,K\} \end{align*}¶
$K=2$¶
It is equivalent to consider a linear discriminant function $y(\mathbf{x}) = \mathbf{w}^\intercal \mathbf{x} + w_0$ which assigns $\mathbf{x}$ to
class $\mathcal{C}_1$ if $y(\mathbf{x}) \ge 0$, and
class $\mathcal{C}_2$ otherwise.
where $\mathbf{w}$ is called a weight vector, and $w_0$ is a bias. The negative of the bias is sometimes called a threshold.
$K>2$¶
Although it is tempting to build a $K$-class discriminant by combining two-class functions as follows:
One-versus-the-rest: $K-1$ classifiers each of which solves a two-class problem of separating points in a particular class $\mathcal{C}_k$ from points not in that class,
One-versus-one: $K(K-1)/2$ binary discriminant functions, one for every possible pair of classes,
both approach lead to regions of input space that are ambiguously classified as shown in the above picture.
¶
Least Squares¶
Consider a training data set $\{\mathbf{x}_i, \mathbf{t}_i\}$ where $i = 1,\ldots,n$. The least squares approach is to find $\mathbf{w}_k$, $k = 1, \ldots, K$ such that the sum-of-squares error between $\mathbf{y}(\mathbf{x_i}) = (y_1(\mathbf{x}), \ldots, y_K(\mathbf{x}))$ and $\mathbf{t}_i$ is minimal.
We can conveniently group those linear models together using vector notation so that \begin{align} \mathbf{y}(\mathbf{x}) = \tilde{\mathbf{W}}^T \tilde{\mathbf{x}} \nonumber \end{align}
where
\begin{align} \tilde{\mathbf{W}} = \left( \begin{array}{lll} w_{10} & \cdots & w_{K0} \\ \mathbf{w}_1^\intercal & \cdots & \mathbf{w}_K^\intercal \end{array} \right) = \left( \begin{array}{lll} w_{10} & \cdots & w_{K0} \\ w_{11} & \cdots & w_{K1} \\ \vdots & \ddots & \vdots \\ w_{1d} & \cdots & w_{Kd} \\ \end{array} \right)_{(p+1) \times K},~ \tilde{\mathbf{x}} = \left( \begin{array}{l} 1 \\ x_{1} \\ \vdots \\ x_{p} \\ \end{array} \right) \nonumber \end{align}Consider a training data set $\{\mathbf{x}_i, \mathbf{t}_i\}$ where $i = 1,\ldots,n$ and define
\begin{align} \mathbf{T} = \left( \begin{array}{l} \mathbf{t}_1 \\ \mathbf{t}_2 \\ \vdots \\ \mathbf{t}_n \\ \end{array} \right)_{n \times K} \text{ and } \tilde{\mathbf{X}} = \left( \begin{array}{l} \tilde{\mathbf{x}}^\intercal_1 \\ \tilde{\mathbf{x}}^\intercal_2 \\ \vdots \\ \tilde{\mathbf{x}}^\intercal_n \\ \end{array} \right)_{n \times (p+1)} \nonumber \end{align}where $\mathbf{t}_i = (0, \ldots, 0, \underbrace{1}_{k}, 0, \ldots, 0 )_{K \times 1}$ indicates $\mathbf{x}_i \in k$-class. Minimizing the sum-of-squares error function
\begin{align} \mathbb{SSE} \left[ \tilde{\mathbf{W}} \right] = \frac{1}{2} \text{Tr} \left\{ \left( \tilde{\mathbf{X}}\tilde{\mathbf{W}} - \mathbf{T}\right)^\intercal \left( \tilde{\mathbf{X}}\tilde{\mathbf{W}} - \mathbf{T}\right) \right\} \text{ gives } \tilde{\mathbf{W}} = \left( \tilde{\mathbf{X}}^\intercal \tilde{\mathbf{X}} \right)^{-1} \tilde{\mathbf{X}}^\intercal \mathbf{T} \nonumber \end{align}As shown in the previous chapter, the least squares is equivalent to the maximum likelihood of a Gaussian conditional distribution that $p(y|\mathbf{x}, \mathbf{w}, \beta) = \mathcal{N}(y | f(\mathbf{x}, \mathbf{w}), \beta^{-1})$.
The plots demonstrate the least squares method for classification with the 3 datasets given at the begining of this chapter.
x1_test, x2_test = np.meshgrid(np.linspace(-5, 12, 100), np.linspace(-5, 12, 100))
x_test = np.array([x1_test, x2_test]).reshape(2, -1).T
plt.figure(figsize=(15, 5))
for i, (key, value) in enumerate(train_data.items()):
plt.subplot(1, 3, i+1)
x_train, t_train = value
# t = argmax of least square, e.g., (0,1,2) for 3-class data
t = LeastSquares().fit(x_train, t_train).predict(x_test)
plt.scatter(x_train[:,0], x_train[:,1], c=t_train, cmap=matplotlib.colors.ListedColormap(['red','green','blue']))
plt.contourf(x1_test, x2_test, t.reshape(100, 100), alpha=0.3, levels=np.array([0., 0.5, 1.5, 2.]),
cmap=matplotlib.colors.ListedColormap(['yellow','green','purple']))
plt.xlim(-5, 12)
plt.ylim(-5, 12)
plt.gca().set_aspect('equal', adjustable='box')
plt.title("{}".format(key))
plt.show()
Although, the least-squares approach classifies 2 class fairly well. It lacks roubustness, highly sensitive to outliers, as shown for the 2-class_ex data. Basically, The sum-of-squares error function penalizes predictions that are "too correct" in that they lie a long way on the correct side of the decision.
It gives poor classification for the 3-class data and assigns small region to the middle cluster. The problem is serious for the least-squares approach when $K \ge 3$ where classes can be masked by others. Consider an extreme situation with simulated dataset as follows.
We have 3 classes of data $(x^k_1,x^k_2)$, $k=1,2,3$ in a plane where \begin{align} x^k_1, x^k_2 \in \left\{ \begin{aligned} &[-6, -2] && k=1 \text{ for 1st class}\\ &[-2, 2] && k=2 \text{ for 2nd class}\\ &[2,6] && k=3 \text{ for 3rd class} \end{aligned} \right. \label{data_3groups} \end{align}
The middle class is mostly masked (dominated) by other two classes as many points of the middle class are assigned colors of other two classes. It is understandable that, for most of $\mathbf{x} \in \mathcal{C}_k$, there always exists $j$ such that $y_{k}(\mathbf{x}) < y_j(\mathbf{x})$ for many. Thus, class $\mathcal{C}_k$ is masked by the others because the nature of linear function.
size = 100
np.random.seed(123)
x0 = np.random.multivariate_normal([-4, -4], np.eye(2), size)
x1 = np.random.multivariate_normal([ 0, 0], np.eye(2), size)
x2 = np.random.multivariate_normal([ 4, 4], np.eye(2), size)
x = np.concatenate((x0, x1, x2))
x_tilde = np.hstack((np.ones((size*3, 1)), x))
t1, t2, t3 = np.eye(3)
t = np.vstack((np.tile(t1, (size, 1)),
np.tile(t2, (size, 1)),
np.tile(t3, (size, 1))))
reg = LinearRegression().fit(x, t)
y = reg.predict(x)
t_pred = y.argmax(axis=1)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
polyreg = Pipeline([('poly', PolynomialFeatures(degree=2)),
('linear', LinearRegression())])
poly_y = polyreg.fit(x, t).predict(x)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x0[:,0], x0[:,1], 'o', color='C0', markersize=3)
plt.plot(x1[:,0], x1[:,1], 'o', color='C1', markersize=3)
plt.plot(x2[:,0], x2[:,1], 'o', color='C2', markersize=3)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Simulated Data')
plt.subplot(1, 2, 2)
plt.plot(x[t_pred == 0][:,0], x[t_pred == 0][:,1], 'o', color='C0', markersize=3)
plt.plot(x[t_pred == 1][:,0], x[t_pred == 1][:,1], 'o', color='C1', markersize=3)
plt.plot(x[t_pred == 2][:,0], x[t_pred == 2][:,1], 'o', color='C2', markersize=3)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Classification using Linear Regression')
plt.show()
We have three discriminant functions $y_k(\mathbf{x})$ with $k=1,2,3$. For each simulated point $(x_1, x_2)$, let $\hat{y} = \{\hat{y}_0, \hat{y}_1, \hat{y}_2\}$ be the value based on linear (or degree 2 polynomial) regression equation, where $\hat{y}$ has three components corresponding to three probabilities for each class.
Each figure below has two components:
The main part shows the $\hat{y}$ values where each color corresponds to a discriminant function
The code
plot(x, y[:, 0], 'o', color='C0', markersize=2)
plots two points $(x^k_1, \hat{y}_0)$ and $(x^k_2, \hat{y}_0)$ for all $k=1,2,3$,The line at the bottom shows the $x$ values that
The code
plot(x0, [y_floor]*size, 'o', color='C0', markersize=2)
plots two points $(x^0_1, \hat{y}_0)$ and $(x^0_2, \hat{y}_0)$. The line shows how the data points are defined as in \eqref{data_3groups}
The classified class is defined as
\begin{align*} k^* = \text{argmax}_k \{y_k(\mathbf{x}): k = 1,\ldots,K\} \end{align*}or t_pred = reg.predict(x).argmax(axis=1)
in the code.
¶
Linear Regression¶
Because of this argmax
operator, it is clear that the first group is largely classified as blue, and the third group is largely classified as green. However, the middle class cannot be identified in linear regression model because many times
Polynomial(deg=2) Regression¶
Because of this argmax
operator, the middle class can be identified in a polynomial regression model. However, it can be seen that if there were 4 rather than 3 classes lined up as the simulated data, a quadratic would not be enough.
¶
fig = plt.figure(1, figsize=(12, 5))
ax1 = fig.add_subplot(1, 2, 1)
# plot([[x1,x2]], y) plots two points (x1,y) and (x2,y)
# any point in Simulated Data figure gives two points here
ax1.plot(x, y[:, 0], 'o', color='C0', markersize=2)
ax1.plot(x, y[:, 1], 'o', color='C1', markersize=2)
ax1.plot(x, y[:, 2], 'o', color='C2', markersize=2)
y_floor, _ = ax1.get_ylim()
ax1.plot(x0, [y_floor]*size, 'o', color='C0', markersize=2)
ax1.plot(x1, [y_floor]*size, 'o', color='C1', markersize=2)
ax1.plot(x2, [y_floor]*size, 'o', color='C2', markersize=2)
ax1.set_title('Linear Regression')
ax2 = fig.add_subplot(1, 2, 2)
ax2.plot(x, poly_y[:, 0], 'o', color='C0', markersize=2)
ax2.plot(x, poly_y[:, 1], 'o', color='C1', markersize=2)
ax2.plot(x, poly_y[:, 2], 'o', color='C2', markersize=2)
y_floor, _ = ax2.get_ylim()
ax2.plot(x0, [y_floor]*size, 'o', color='C0', markersize=2)
ax2.plot(x1, [y_floor]*size, 'o', color='C1', markersize=2)
ax2.plot(x2, [y_floor]*size, 'o', color='C2', markersize=2)
ax2.set_title('Polynomial(deg=2) Regression')
plt.show()
The failure of least squares should not surprise us when we recall that it corresponds to maximum likelihood under the assumption of a Gaussian conditional distribution, whereas binary target vectors clearly have a distribution that is far from Gaussian.
Linear Discriminant Analysis¶
Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications. The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting ("curse of dimensionality") and also reduce computational costs.
For optimal classification, we need to know the class posteriors $p(\mathcal{C}_k | \mathbf{x})$ of our observation $\mathbf{x}$:
\begin{align*} p(\mathcal{C}_k | \mathbf{x}) = \frac{p(\mathbf{x}|\mathcal{C}_k)p(\mathcal{C}_k)}{p(\mathbf{x})} = \frac{p(\mathbf{x}|\mathcal{C}_k)p(\mathcal{C}_k)}{\sum_{j}p(\mathbf{x}|\mathcal{C}_j)p(\mathcal{C}_j)} \end{align*}Assume class-conditional densities are Gaussian with common covariance matrix $\Sigma$.
\begin{align} p(\mathbf{x}|\mathcal{C}_k) = \frac{1}{(2\pi)^{D/2}} \frac{1}{\lvert \Sigma \rvert ^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \mathbf{\mu}_k)^\intercal \Sigma^{-1} (\mathbf{x} - \mathbf{\mu}_k) \right) \nonumber \end{align}If $p(\mathbf{x}|\mathcal{C}_k)$ has its own covariance matrix $\Sigma_k$, then we will obtain $y_k(\mathbf{x})$ as a quadratic discriminant function of $\mathbf{x}$, giving rise to a quadratic discriminant analysis.
When comparing two classes $k$, $\ell$ by $p(\mathcal{C}_k | \mathbf{x}) - p(\mathcal{C}_\ell | \mathbf{x})$, it is sufficient to consider
\begin{align} \ln p(\mathcal{C}_k| \mathbf{x}) - \ln p(\mathcal{C}_\ell | \mathbf{x}) &= \ln \frac{p(\mathcal{C}_k | \mathbf{x})}{p(\mathcal{C}_\ell | \mathbf{x})} = \ln \frac{p(\mathbf{x}|\mathcal{C}_k)p(\mathcal{C}_k)}{p(\mathbf{x}|\mathcal{C}_\ell)p(\mathcal{C}_\ell)} \nonumber \\ & = \mathbf{x}^\intercal \Sigma^{-1} (\mu_k - \mu_\ell) - \frac{1}{2} (\mu_k+\mu_\ell)^\intercal \Sigma^{-1} (\mu_k - \mu_\ell) + \ln \frac{p(\mathcal{C}_k)}{p(\mathcal{C}_\ell)} \tag{LDA} \\ & \equiv \ln \frac{y_k(\mathbf{x})}{y_\ell(\mathbf{x})} \nonumber \end{align}The discriminant function \begin{align*} y_k(\mathbf{x}) = \ln p(\mathbf{x}|\mathcal{C}_k)p(\mathcal{C}_k) &= \mathbf{w}^\intercal_k \mathbf{x} + w_{k0} \text{ where}\\ \mathbf{w}_k &= \Sigma^{-1} \mu_k \\ w_{k0} &= - \frac{1}{2} \mu^\intercal_k \Sigma^{-1} \mu_k + \ln p(\mathcal{C}_k) \end{align*}
The parameters of the Gaussian distributions are estimated through training data
$p(\mathcal{C}_k) = n_k / n$ where $n_k$ is the number of class-$k$ observations
$\mu_k = \sum_{j \in \mathcal{C}_k} \mathbf{x}_j / n_k$
$\Sigma = 1/n \sum_{k=1}^{K} \sum_{j \in \mathcal{C}_k} (\mathbf{x}_j - \mu_k)^\intercal (\mathbf{x}_j - \mu_k)$
It is surprising that LDA and QDA have a good track record in many applications. The reason is not likely to be that the data are approximately Gaussian, and in addition for LDA that the covariances are approximately equal. More likely a reason is that the data can only support simple decision boundaries such as linear or quadratic.
Fisher's Linear Discriminant¶
Fisher derived LDA via a route without referring to Gaussian distributions.
Fisher's idea is to maximize a function that
give a large separation between the projected class means
give a small variance within each class
thereby minimizing the class overlap.
Consider a two-class problem with $n_1$ points of class $\mathcal{C}_1$ and $n_2$ points of class $\mathcal{C}_2$
We need reduce $p$-dimension $\mathbf{x}$ to one dimension using $y = \mathbf{w}^\intercal \mathbf{x}$
We choose $\mathbf{w}$ to maximize $\mathbf{w}^\intercal(\mathbf{m}_1-\mathbf{m}_2) \equiv m_1-m_2$, with scaler $\sum_{i} w^2_i=1$.
where the mean vectors of two classes
\begin{align} \mathbf{m}_1 = \frac{1}{n_1} \sum_{i \in \mathcal{C}_1} \mathbf{x}_i \text{ and } \mathbf{m}_2 = \frac{1}{n_2} \sum_{i \in \mathcal{C}_2} \mathbf{x}_i \nonumber \end{align}¶
Fisher's method¶
We can get \begin{align} \mathbf{w} \propto S^{-1}_W(\mathbf{m}_2-\mathbf{m}_1) \nonumber \end{align}
where $S_W$ is the total within-class covariance matrix. Note $S^{-1}_W$ is equivalent to $\Sigma^{-1}$ in (LDA). Hence, Fisher's method gives the same discriminant function as in (LDA).
Technique Details¶
Fisher's method is to maximize
\begin{align} J(\mathbf{w}) &= \frac{(m_1 - m_2)^2}{\sum_{k=1}^{2}\sum_{i \in \mathcal{C}_k} (y_i-m_k)^2} = \frac{\text{between class variance}}{\text{within class variance}} \nonumber \\ &= \frac{\mathbf{w}^\intercal (\mathbf{m}_2-\mathbf{m}_1) \bigg(\mathbf{w}^\intercal(\mathbf{m}_2-\mathbf{m}_1)\bigg)^\intercal}{\sum_{k=1}^{2} \sum_{i \in \mathcal{C}_k} \mathbf{w}^\intercal(\mathbf{x}_i - \mathbf{m}_1) \bigg(\mathbf{w}^\intercal(\mathbf{x}_i - \mathbf{m}_1)\bigg)^\intercal} \nonumber \\ &= \frac{\mathbf{w}^\intercal (\mathbf{m}_2-\mathbf{m}_1)(\mathbf{m}_2-\mathbf{m}_1)^\intercal \mathbf{w}}{\mathbf{w}^\intercal \sum_{k=1}^{2} \sum_{i \in \mathcal{C}_k} (\mathbf{x}_i - \mathbf{m}_k)(\mathbf{x}_i - \mathbf{m}_k)^\intercal \mathbf{w}} \nonumber \\ &\equiv \frac{\mathbf{w}^\intercal \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\intercal \mathbf{S}_W \mathbf{w}} \nonumber \end{align}\begin{align} J^\prime(\mathbf{w}) &= 0 \Rightarrow \left(\mathbf{w}^\intercal \mathbf{S}_W \mathbf{w}\right) \mathbf{S}_B \mathbf{w} = \left(\mathbf{w}^\intercal \mathbf{S}_B \mathbf{w}\right) \mathbf{S}_W \mathbf{w} \Rightarrow \mathbf{w} \propto S^{-1}_W(\mathbf{m}_2-\mathbf{m}_1) \nonumber \end{align}For general $K$, Fisher's method is to maximize so called Rayleigh quotient
\begin{align} \max_{\mathbf{w}} J(\mathbf{w}) = \frac{\mathbf{w}^\intercal \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\intercal \mathbf{S}_W \mathbf{w}} = \frac{\text{between class variance}}{\text{within class variance}} \nonumber \end{align}where \begin{align*} S_B &= \sum_{k=1}^{K} n_{k} (\mathbf{m}_k - \mathbf{m}) (\mathbf{m}_k - \mathbf{m})^\intercal \\ S_W &= \sum_{k=1}^{K} S_k \text{ with } S_k = \sum_{i \in \mathcal{C}_k} (\mathbf{x}_i - \mathbf{m}_k) (\mathbf{x}_i - \mathbf{m}_k)^\intercal \end{align*}
\begin{align*} \mathbf{m}_k &= \frac{1}{n_k} \sum_{i \in \mathcal{C}_k} \mathbf{x}_i \\ \mathbf{m} &= \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_i = \frac{1}{n} \sum_{k=1}^{K} n_k \mathbf{m}_k \end{align*}and $n_k$ is the sample size of the class $k$.
Because $J(\mathbf{w})$ is invariant w.r.t. rescaling of $\mathbf{w} \leftarrow \alpha \mathbf{w}$, so that we can choose $\alpha$ as to have a unit denominator $\mathbf{w}^\intercal S_W \mathbf{w} = 1$ (since it is a scalar). Thus
\begin{align*} \max_{\mathbf{w}}~ J(\mathbf{w}) = \min_{\mathbf{w}} - \mathbf{w}^\intercal S_B \mathbf{w} ~\text{ s.t. } \mathbf{w}^\intercal S_W \mathbf{w} = 1 \end{align*}Let $\mathbf{w}^\ast = S_W^{1/2} \mathbf{w}$ and $S^*_B = (S_W^{-1/2})^\intercal S_B S_W^{-1/2}$, we have
\begin{align*} \min_{\mathbf{w}^*}~ -(\mathbf{w}^*)^\intercal S^*_B \mathbf{w}^* ~\text{ s.t. } (\mathbf{w}^*)^\intercal \mathbf{w}^* = 1 \end{align*}Note that let $M$ be the $K \times p$ matrix of centroids $\mu_k$. Then
\begin{align*} S_B &= M^\intercal M \\ S^*_B &= (M S_W^{-1/2})^\intercal (M S_W^{-1/2}) \end{align*}The Lagrangien is
\begin{align*} L = -(\mathbf{w}^{*})^{\intercal}S_B^{*}\mathbf{w}^{*} + \lambda[(\mathbf{w}^{*})^{\intercal}\mathbf{w}^{*}-1] \end{align*}Its derivative gives
\begin{align*} S_B^{*}\mathbf{w}^{*}=\lambda \mathbf{w}^{*} \end{align*}The optimization becomes an eigendecomposition problem.
¶
The relationship between LDA and linear regression
When $K=2$, there is a simple correspondence between LDA and linear regression:
we adopt a different target coding scheme that
$t_i =n/n_1$ if the input $\mathbf{x}_i$ is in class $\mathcal{C}_1$ (the value approximates the reciprocal of the prior probability for class $\mathcal{C}_1$) and;
$t_i = -n/n_2$ if the input $\mathbf{x}_i$ is in class $\mathcal{C}_2$,
where $n$ is the total number observations and $n_k$ ($k=1, 2$) is the number observations in class $k$.
Minimizing the sum of squares error function
\begin{align} \frac{1}{2} \sum_{i=1}^{n} \left(\mathbf{w}^\intercal \mathbf{x}_i + w_0 - t_i \right)^2 \text{ gives } \mathbf{w} \propto S^{-1}_W(\mathbf{m}_2-\mathbf{m}_1) \nonumber \end{align}The code demonstrates the implementation of Fisher's method for 2-class data:
x_train, t_train = train_data['2-class']
x1_test, x2_test = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
x_test = np.array([x1_test, x2_test]).reshape(2, -1).T
X0 = x_train[t_train == 0]
X1 = x_train[t_train == 1]
m0 = np.mean(X0, axis=0)
m1 = np.mean(X1, axis=0)
cov_inclass = (X0 - m0).T @ (X0 - m0) + (X1 - m1).T @ (X1 - m1)
w = np.linalg.solve(cov_inclass, m1 - m0)
w /= np.linalg.norm(w).clip(min=1e-10)
mu0 = np.mean(X0 @ w, axis=0)
mu1 = np.mean(X1 @ w, axis=0)
var0 = np.var(X0 @ w, axis=0)
var1 = np.var(X1 @ w, axis=0)
a = var1 - var0
b = var0 * mu1 - var1 * mu0
c = var1 * mu0 ** 2 - var0 * mu1 ** 2 - var1 * var0 * np.log(var1 / var0)
threshold = (np.sqrt(b ** 2 - a * c) - b) / a
t = (x_test @ w > threshold).astype(int)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t_train)
plt.contourf(x1_test, x2_test, t.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3),
cmap=matplotlib.colors.ListedColormap(['yellow','green','purple']))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
LDA Example: Iris Data¶
Consider the famous "Iris" dataset which contains measurements for 150 iris flowers from three different species.
The three classes ($K=3$) in the Iris dataset:
Iris-setosa (n=50)
Iris-versicolor (n=50)
Iris-virginica (n=50)
The four features ($p=4$) of the Iris dataset:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
We define
\begin{align} \pmb X = \begin{bmatrix} x_{1_{\text{sepal length}}} & x_{1_{\text{sepal width}}} & x_{1_{\text{petal length}}} & x_{1_{\text{petal width}}}\\ x_{2_{\text{sepal length}}} & x_{2_{\text{sepal width}}} & x_{2_{\text{petal length}}} & x_{2_{\text{petal width}}}\\ ... \\ x_{150_{\text{sepal length}}} & x_{150_{\text{sepal width}}} & x_{150_{\text{petal length}}} & x_{150_{\text{petal width}}}\\ \end{bmatrix}, ~~ \pmb y = \begin{bmatrix} \omega_{\text{setosa}}\\ \omega_{\text{setosa}}\\ ... \\ \omega_{\text{virginica}}\end{bmatrix} \nonumber \end{align}We use the LabelEncode
from the scikit-learn
library to convert the class labels into numbers: 1, 2, and 3.
feature_dict = {i:label for i,label in zip(
range(4),
('sepal length in cm',
'sepal width in cm',
'petal length in cm',
'petal width in cm', ))}
df = pd.read_csv(
'https://raw.githubusercontent.com/ming-zhao/Optimization-and-Learning/master/data/iris.data',
header=None, sep=',')
df.columns = [l for i,l in sorted(feature_dict.items())] + ['class label']
df.dropna(how="all", inplace=True) # to drop the empty line at file-end
# df.tail()
from sklearn.preprocessing import LabelEncoder
X = df[df.columns[:-1]].values
y = df['class label'].values
enc = LabelEncoder()
label_encoder = enc.fit(y)
y = label_encoder.transform(y) + 1
label_dict = {1: 'Setosa', 2: 'Versicolor', 3:'Virginica'}
Histograms and Feature Selection¶
To get a rough idea how the samples of three classes are distributed, let us visualize the distributions of the four different features in 1-dimensional histograms.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,6))
for ax,cnt in zip(axes.ravel(), range(4)):
# set bin sizes
min_b = np.floor(np.min(X[:,cnt]))
max_b = np.ceil(np.max(X[:,cnt]))
bins = np.linspace(min_b, max_b, 25)
# plottling the histograms
for lab,col in zip(range(1,4), ('blue', 'red', 'green')):
ax.hist(X[y==lab, cnt],
color=col,
label='class %s' %label_dict[lab],
bins=bins,
alpha=0.5,)
ylims = ax.get_ylim()
# plot annotation
leg = ax.legend(loc='upper right', fancybox=True, fontsize=8)
leg.get_frame().set_alpha(0.5)
ax.set_ylim([0, max(ylims)+2])
ax.set_xlabel(feature_dict[cnt])
ax.set_title('Iris histogram #%s' %str(cnt+1))
# hide axis ticks
ax.tick_params(axis="both", which="both", bottom=False, top=False,
labelbottom=True, left=False, right=False, labelleft=True)
# remove axis spines
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["left"].set_visible(False)
axes[0][0].set_ylabel('count')
axes[1][0].set_ylabel('count')
fig.tight_layout()
plt.show()
From just looking at these simple graphical representations of the features, we can already tell that the petal lengths and widths (histogram #3 and #4) are likely better suited as potential features two separate between the three flower classes.
In practice, instead of reducing the dimensionality via a projection (here: LDA), a good alternative would be a feature selection technique. For low-dimensional datasets like Iris, a glance at those histograms would already be very informative. Another simple, but very useful technique would be to use feature selection algorithms.
LDA assumes normal distributed data, features that are statistically independent, and identical covariance matrices for every class. However, LDA can work reasonably well if those assumptions are violated.
In practice, LDA for dimensionality reduction would be just another preprocessing step for a typical machine learning or pattern classification task.
Computing Mean Vectors¶
In this first step, we will start off with a simple computation of the mean vectors $\mathbf{m}_i$, $i=1,2,3$ of the 3 different flower classes:
\begin{align*} \mathbf{m}_i = \begin{bmatrix} \mu_{\omega_i (\text{sepal length)}}\\ \mu_{\omega_i (\text{sepal width})}\\ \mu_{\omega_i (\text{petal length)}}\\ \mu_{\omega_i (\text{petal width})}\\ \end{bmatrix} \; , \quad \text{with} \quad i = 1,2,3 \end{align*}np.set_printoptions(precision=4)
mean_vectors = []
for cl in range(1,4):
mean_vectors.append(np.mean(X[y==cl], axis=0))
print('Mean Vector class %s: %s' %(cl, mean_vectors[cl-1]))
Mean Vector class 1: [5.006 3.418 1.464 0.244] Mean Vector class 2: [5.936 2.77 4.26 1.326] Mean Vector class 3: [6.588 2.974 5.552 2.026]
Computing the Scatter Matrices¶
Now, we will compute the two $4 \times 4$ $(p \times p)$-dimensional matrices: The within-class and the between-class scatter matrix.
Within-class scatter matrix \begin{align*} S_W = \sum_{k=1}^{K} S_k \end{align*}
where \begin{align*} S_k &= \sum_{i \in \mathcal{C}_k} (\mathbf{x}_i - \mathbf{m}_k) (\mathbf{x}_i - \mathbf{m}_k)^\intercal \\ \mathbf{m}_k &= \frac{1}{n_k} \sum_{i \in \mathcal{C}_k} \mathbf{x}_i \end{align*}
and $n_k$ is the sample size of the class $k$.
S_W = np.zeros((4,4))
for cl,mv in zip(range(1,4), mean_vectors):
class_sc_mat = np.zeros((4,4)) # scatter matrix for every class
for row in X[y == cl]:
row, mv = row.reshape(4,1), mv.reshape(4,1) # make column vectors
class_sc_mat += (row-mv).dot((row-mv).T)
S_W += class_sc_mat # sum class scatter matrices
print('within-class Scatter Matrix:\n', S_W)
within-class Scatter Matrix: [[38.9562 13.683 24.614 5.6556] [13.683 17.035 8.12 4.9132] [24.614 8.12 27.22 6.2536] [ 5.6556 4.9132 6.2536 6.1756]]
The total variance matrix \begin{align*} S_T = \sum_{i=1}^{n} (\mathbf{x}_i - \mathbf{m}) (\mathbf{x}_i - \mathbf{m})^\intercal \end{align*}
where $\mathbf{m}$ is the mean of total data set
\begin{align*} \mathbf{m} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_i = \frac{1}{n} \sum_{k=1}^{K} n_k \mathbf{m}_k \end{align*}\begin{align*} S_T = S_W + S_B \end{align*}where $S_B$ is between-class scatter matrix
\begin{align*} S_B = \sum_{k=1}^{K} N_{k} (\mathbf{m}_k - \mathbf{m}) (\mathbf{m}_k - \mathbf{m})^\intercal \end{align*}overall_mean = np.mean(X, axis=0)
S_B = np.zeros((4,4))
for i,mean_vec in enumerate(mean_vectors):
n = X[y==i+1,:].shape[0]
mean_vec = mean_vec.reshape(4,1) # make column vector
overall_mean = overall_mean.reshape(4,1) # make column vector
S_B += n * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T)
print('between-class Scatter Matrix:\n', S_B)
between-class Scatter Matrix: [[ 63.2121 -19.534 165.1647 71.3631] [-19.534 10.9776 -56.0552 -22.4924] [165.1647 -56.0552 436.6437 186.9081] [ 71.3631 -22.4924 186.9081 80.6041]]
Solving the Generalized Eigenvalue Problem¶
Next, we will solve the generalized eigenvalue problem for the matrix $S^{-1}_W S_B$ to obtain the linear discriminants where
\begin{align*} (S^{-1}_W S_B) v = \lambda v \end{align*}$\lambda$: eigenvalue
$v$: eigenvector
eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))
for i in range(len(eig_vals)):
eigvec_sc = eig_vecs[:,i].reshape(4,1)
print('\nEigenvector {}: \n{}'.format(i+1, eigvec_sc.real))
print('Eigenvalue {:}: {:.2e}'.format(i+1, eig_vals[i].real))
Eigenvector 1: [[-0.2049] [-0.3871] [ 0.5465] [ 0.7138]] Eigenvalue 1: 3.23e+01 Eigenvector 2: [[-0.009 ] [-0.589 ] [ 0.2543] [-0.767 ]] Eigenvalue 2: 2.78e-01 Eigenvector 3: [[-0.884 ] [ 0.3053] [ 0.282 ] [ 0.214 ]] Eigenvalue 3: 7.64e-16 Eigenvector 4: [[ 0.3921] [-0.4324] [-0.4832] [ 0.6526]] Eigenvalue 4: 7.70e-15
Recall from our linear algebra, both eigenvectors and eigenvalues are providing us with information about the distortion of a linear transformation: The eigenvectors are basically the direction of this distortion, and the eigenvalues are the scaling factor for the eigenvectors that describing the magnitude of the distortion.
If we are performing the LDA for dimensionality reduction, the eigenvectors are important since they will form the new axes of our new feature subspace; the associated eigenvalues are of particular interest since they will tell us how "informative" the new "axes" are.
Checking the eigenvector-eigenvalue calculation
for i in range(len(eig_vals)):
eigv = eig_vecs[:,i].reshape(4,1)
np.testing.assert_array_almost_equal(np.linalg.inv(S_W).dot(S_B).dot(eigv),
eig_vals[i] * eigv,
decimal=6, err_msg='', verbose=True)
Constructing the New Feature Subspace¶
We are interested in
projecting the data into a subspace that improves the class separability,
reducing the dimensionality of our feature space, where the eigenvectors will form the axes of this new feature subspace.
So, in order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace, we have to take a look at the corresponding eigenvalues of the eigenvectors (eigenvectors all have all the same unit length 1).
Roughly speaking, the eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data, and those are the ones we want to drop.
The common approach is to rank the eigenvectors from highest to lowest corresponding eigenvalue and choose the top $C$ eigenvectors.
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in decreasing order:\n')
for i in eig_pairs:
print(i[0])
Eigenvalues in decreasing order: 32.27195779972982 0.27756686384003953 7.698215548377934e-15 7.6422202507159e-16
In fact, these two last eigenvalues should be exactly zero. They are not because of floating-point imprecision.
In LDA, the number of linear discriminants is at most $K-1$ where $K$ is the number of class labels, since the in-between scatter matrix $S_B$ is the sum of $K$ matrices with rank 1 or less.
Now, let's express the "explained variance" as percentage:
print('Variance explained:\n')
eigv_sum = sum(eig_vals)
for i,j in enumerate(eig_pairs):
print('eigenvalue {0:}: {1:.2%}'.format(i+1, (j[0]/eigv_sum).real))
Variance explained: eigenvalue 1: 99.15% eigenvalue 2: 0.85% eigenvalue 3: 0.00% eigenvalue 4: 0.00%
The first eigenpair is by far the most informative one, and we won't loose much information if we would form a 1D-feature spaced based on this eigenpair.
Now choose $K$ eigenvectors with the largest eigenvalues
After sorting the eigenpairs by decreasing eigenvalues, it is now time to construct our $p \times C$-dimensional eigenvector matrix $W$ (here $4 \times 2$: based on the 2 most informative eigenpairs) and thereby reducing the initial 4-dimensional feature space into a 2-dimensional feature subspace.
W = np.hstack((eig_pairs[0][1].reshape(4,1), eig_pairs[1][1].reshape(4,1)))
print('Matrix W:\n', W.real)
Matrix W: [[-0.2049 -0.009 ] [-0.3871 -0.589 ] [ 0.5465 0.2543] [ 0.7138 -0.767 ]]
Transforming the samples¶
We use $W$ to transform our samples onto the new subspace via the equation
\begin{align*} Y = X_{n \times p} W_{p \times C} \end{align*}X_lda = X.dot(W)
assert X_lda.shape == (150,2), "The matrix is not 150x2 dimensional."
from matplotlib import pyplot as plt
def plot_step_lda():
ax = plt.subplot(111)
for label,marker,color in zip(
range(1,4),('^', 's', 'o'),('blue', 'red', 'green')):
plt.scatter(x=X_lda[:,0].real[y == label],
y=X_lda[:,1].real[y == label],
marker=marker,
color=color,
alpha=0.5,
label=label_dict[label]
)
plt.xlabel('LD1')
plt.ylabel('LD2')
leg = plt.legend(loc='upper right', fancybox=True)
leg.get_frame().set_alpha(0.5)
plt.title('LDA: Iris projection onto the first 2 linear discriminants')
# hide axis ticks
plt.tick_params(axis="both", which="both", bottom=False, top=False,
labelbottom=True, left=False, right=False, labelleft=True)
# remove axis spines
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["left"].set_visible(False)
plt.grid()
plt.tight_layout
plt.show()
plot_step_lda()
The scatter plot above represents our new feature subspace that we constructed via LDA. We can see that the first linear discriminant "LD1" separates the classes quite nicely. However, the second discriminant, "LD2", does not add much valuable information, which we've already concluded when we looked at the ranked eigenvalues is step 4.
PCA vs LDA¶
Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques that are commonly used for dimensionality reduction.
PCA can be described as an "unsupervised" algorithm, since it "ignores" class labels and its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset.
LDA is "supervised" and computes the directions ("linear discriminants") that will represent the axes that that maximize the separation between multiple classes.
Although it might sound intuitive that LDA is superior to PCA for a multi-class classification task where the class labels are known, this might not always the case. For example, comparisons between classification accuracies for image recognition after using PCA or LDA show that PCA tends to outperform LDA if the number of samples per class is relatively small (PCA vs. LDA, A.M. Martinez et al., 2001).
In practice, often a LDA follows a PCA for dimensionality reduction (PCA can be performed before LDA to regularize the problem and avoid over-fitting).
There is a more convenient way to achieve the same via the LDA classifier (or PCA) implemented in the scikit-learn
.
from sklearn.decomposition import PCA as sklearnPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
sklearn_lda = LDA(n_components=2)
X_lda_sklearn = sklearn_lda.fit_transform(X, y)
def plot_scikit_lda(X, title):
ax = plt.subplot(111)
for label,marker,color in zip(
range(1,4),('^', 's', 'o'),('blue', 'red', 'green')):
plt.scatter(x=X[:,0][y == label],
y=X[:,1][y == label] * -1, # flip the figure
marker=marker,
color=color,
alpha=0.5,
label=label_dict[label])
plt.xlabel('LD1')
plt.ylabel('LD2')
leg = plt.legend(loc='upper right', fancybox=True)
leg.get_frame().set_alpha(0.5)
plt.title(title)
# hide axis ticks
plt.tick_params(axis="both", which="both", bottom=False, top=False,
labelbottom=True, left=False, right=False, labelleft=True)
# remove axis spines
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["left"].set_visible(False)
plt.grid()
plt.tight_layout
plt.show()
sklearn_pca = sklearnPCA(n_components=2)
X_pca = sklearn_pca.fit_transform(X)
def plot_pca():
ax = plt.subplot(111)
for label,marker,color in zip(
range(1,4),('^', 's', 'o'),('blue', 'red', 'green')):
plt.scatter(x=X_pca[:,0][y == label],
y=X_pca[:,1][y == label],
marker=marker,
color=color,
alpha=0.5,
label=label_dict[label]
)
plt.xlabel('PC1')
plt.ylabel('PC2')
leg = plt.legend(loc='upper right', fancybox=True)
leg.get_frame().set_alpha(0.5)
plt.title('PCA: Iris projection onto the first 2 principal components')
# hide axis ticks
plt.tick_params(axis="both", which="both", bottom=False, top=False,
labelbottom=True, left=False, right=False, labelleft=True)
# remove axis spines
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["left"].set_visible(False)
plt.tight_layout
plt.grid()
plt.show()
plot_scikit_lda(X_lda_sklearn, title='Default LDA via scikit-learn')
plot_pca()
Generative Models¶
In a generative approach, we model
the class-conditional density $p(\mathbf{x}|\mathcal{C}_k)$ with a specified parametric functional form, and
the class prior $p(\mathcal{C}_k)$
then use these to compute posterior probabilities $p(\mathcal{C}_k|\mathbf{x})$ through Bayes' theorem.
This indirect approach, comparing to model $p(\mathcal{C}_k|\mathbf{x})$ directly, is an example of generative modeling, because we can generate synthetic data by drawing values of $\mathbf{x}$ from $p(\mathbf{x})$ as $p(\mathbf{x}) = \sum_{k} p(\mathbf{x}|\mathcal{C}_k) p(\mathcal{C}_k)$. We demonstrate the generative modeling in the following two cases:
¶
$K=2$¶
When $k=2$, we have \begin{align} p(\mathcal{C}_1 | \mathbf{x}) = \frac{p(\mathbf{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\mathbf{x}|\mathcal{C}_1)p(\mathcal{C}_1) + p(\mathbf{x}|\mathcal{C}_2)p(\mathcal{C}_2)} = \frac{1}{1+\exp(-a)} = \sigma(a) \nonumber \end{align}
with \begin{align} a = \ln \frac{p(\mathbf{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\mathbf{x}|\mathcal{C}_2)p(\mathcal{C}_2)} \triangleq \mathbf{w}^\intercal \mathbf{x} \label{eqn_act} \end{align}
where $\sigma(\cdot)$ is the logistic sigmoid function.
$K > 2$¶
For general $K > 2$, we have
\begin{align*} p(\mathcal{C}_k | \mathbf{x}) &= \frac{p(\mathbf{x}|\mathcal{C}_k)p(\mathcal{C}_k)}{\sum_{j}p(\mathbf{x}|\mathcal{C}_j)p(\mathcal{C}_j)} \\ &= \frac{\exp(a_k)}{\sum_{j}\exp(a_j)} = \text{softmax}_k (a) \end{align*}where the activate function
\begin{align*} a_k = \ln p(\mathbf{x}|\mathcal{C}_k)p(\mathcal{C}_k) \end{align*}and softmax function represents a smoothed version of the "max" function.
¶
Continuous Features¶
Suppose the feature $\mathbf{x}$ is continuous. Let us assume that the class-conditional densities are Gaussian, i.e., \begin{align} p(\mathbf{x}|\mathcal{C}_k) = \frac{1}{(2\pi)^{D/2}} \frac{1}{\lvert \Sigma \rvert ^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \mathbf{\mu}_k)^\intercal \Sigma^{-1} (\mathbf{x} - \mathbf{\mu}_k) \right) \nonumber \end{align}
and then explore the resulting form for the posterior probabilities for $K=2$. Following the generative modeling approach for $K = 2$, we have
\begin{align} a &= \ln \frac{p(\mathbf{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\mathbf{x}|\mathcal{C}_2)p(\mathcal{C}_2)} \nonumber \\ & = \mathbf{x}^\intercal \underbrace{\Sigma^{-1} (\mu_1 - \mu_2)}_{\mathbf{w}} - \underbrace{\frac{1}{2} (\mu_1+\mu_2)^\intercal \Sigma^{-1} (\mu_1 - \mu_2) + \ln \frac{p(\mathcal{C}_1)}{p(\mathcal{C}_2)}}_{w_0} \nonumber \end{align}which is similar to LDA. Hence, \begin{align*} p(\mathcal{C}_1 | \mathbf{x}) = \sigma\left(\mathbf{w}^\intercal \mathbf{x} + w_0\right) \end{align*}
Once we have specified a parameteric functional form for the class-conditional densities $p(\mathbf{x}|\mathcal{C}_k)$, we can determine the values of the parameters using maximal likelihood. Let
\begin{align} p(\mathbf{x}_i, \mathcal{C}_1) &= p(\mathcal{C}_1)p(\mathbf{x}_i|\mathcal{C}_1) = \pi \mathcal{N}(\mathbf{x}_i|\mathbf{\mu}_1, \Sigma) \nonumber \\ p(\mathbf{x}_i, \mathcal{C}_2) &= p(\mathcal{C}_2)p(\mathbf{x}_i|\mathcal{C}_2) = (1-\pi) \mathcal{N}(\mathbf{x}_i|\mathbf{\mu}_2, \Sigma) \nonumber \end{align}The parameters $\pi$, $\mathbf{\mu}_1$, $\mathbf{\mu}_2$ and $\Sigma$ can be obtained by maximizing the likelihood function
\begin{align} p(\mathbf{t}|\pi,\mathbf{\mu}_1,\mathbf{\mu}_2,\Sigma) = \prod_{i=1}^{n} \left( \pi \mathcal{N}(\mathbf{x}_i|\mathbf{\mu}_1, \Sigma)\right)^{t_i} \left( (1-\pi) \mathcal{N}(\mathbf{x}_i|\mathbf{\mu}_2, \Sigma)\right)^{1-t_i} \nonumber \end{align}where $t_i = 1$ denotes class $\mathcal{C}_1$ and $t_i = 0$ denotes class $\mathcal{C}_2$.
Recall that LDA (or QDA) is a generative model and its parameters are estimated by maximizing the likelihood function. However, this approach is not robust to outliers because the maximum likelihood estimation of a Gaussian is not robust to outliers (see Student-t Distribution).
Discrete Features¶
When we have discrete features, e.g., $x_j \in \{0,1\}$. The class-conditional densities $p(\mathbf{x}|\mathcal{C}_k)$ cannot be Gaussian in the case of discrete feature values $x_i$. We make naive Bayes assumption: feature values are independent \begin{align*} p(\mathbf{x}|\mathcal{C}_k) &= \prod_{j=1}^{p} \mu_{kj}^{x_j}(1-\mu_{kj})^{1-x_j} \end{align*}
Following the generative modeling approach for $K>2$, we have \begin{align*} a_k(\mathbf{x}) &= \ln p(\mathbf{x}|\mathcal{C}_k)p(\mathcal{C}_k) = \sum_{j=1}^{p} \left\{ x_j \ln\mu_{kj} + (1-x_j)\ln(1-\mu_{kj}) \right\} + \ln p(\mathcal{C}_k) \end{align*}
$a_k(\mathbf{x})$ is again linear functions of the input values $\mathbf{x}$, and is used to computer the posterior probabilities $p(\mathcal{C}_k | \mathbf{x})$.
Discriminative Models¶
Recall that the generative approach gives
\begin{align} p(\mathcal{C}_1 | \mathbf{x}) = \frac{p(\mathbf{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\mathbf{x}|\mathcal{C}_1)p(\mathcal{C}_1) + p(\mathbf{x}|\mathcal{C}_2)p(\mathcal{C}_2)} = \frac{1}{1+\exp(-a)} = \sigma(a) = \sigma(\mathbf{w}^\intercal \mathbf{x}) \nonumber \end{align}for $K=2$. The posterior probability is a logistic sigmoid acting on a linear function of $\mathbf{x}$ for a wide choice of $p(\mathbf{x}|\mathcal{C}_1)$. For example, we can choose $p(\mathbf{x}|\mathcal{C}_1)$ as Gaussian for continuous features. Actually, as long as the class-conditional densities $p(\mathbf{x}|\mathcal{C}_1)$ is in exponential family, the activation function $a$ is a linear function of $\mathbf{x}$.
The linearity indicates an alternative direct (discriminative) approach that we may model $p(\mathcal{C}_k | \mathbf{x})$ directly by using the functional form of the generalized linear model
\begin{align*} p(\mathcal{C}_k | \mathbf{x}) = f(\mathbf{w}^\intercal_k \mathbf{x}) \end{align*}In logistic regression, we set $f = \sigma(\cdot)$.
Logistic Regression¶
For data set $\{\mathbf{x}_i,t_i\}, i=1,\ldots,n$ with $t_i \in \{0,1\}$, the likelihood function
\begin{align*} p(\mathbf{t}|\mathbf{w}, \phi) = \prod_{i=1}^{n} y_i^{t_i}(1-y_i)^{1-t_i} \tag{Likelihood} \end{align*}where $\mathbf{t} = (t_1,\ldots,t_n)^\intercal$ and $y_i=p(\mathcal{C}_1|\mathbf{x}_i) = \sigma(\mathbf{w}^\intercal \mathbf{x})$. Since we model $p(\mathcal{C}_k | \mathbf{x})$ directly, our likelihood function is different from the one in generative model and does not depend on joint distribution.
The error function is
\begin{align*} E(\mathbf{w}) &= -\ln p(\mathbf{t}|\mathbf{w}) = -\sum_{i=1}^{n} \left( t_i \ln y_i + (1-t_i) \ln (1-y_i) \right) \nonumber \\ \nabla E(\mathbf{w}) &= \sum_{i=1}^{n} (y_i - t_i) \phi_{i} \tag{Error} \end{align*}$\mathbf{w}$ is estimated through iterative reweighted least squares:
\begin{align*} \mathbf{w}^{\text{new}} = \mathbf{w}^{\text{old}} - H^{-1} \nabla E(\mathbf{w}^{\text{old}}) \end{align*}where $H$ is the Hessian matrix.
The approach requires fewer adaptive parameters comparing to generative approach. Note that
LDA (or QDA) is based on generative approach, and logistic regression is based on discriminative approach
LDA and logistic regression have exactly same form that the posterior probability $p(\mathcal{C}_k | \mathbf{x}) = \sigma(\mathbf{w}^\intercal_k \mathbf{x})$
LDA and logistic regression are different in how linear coefficients $\mathbf{w}_k$ are estimated, especially logistic regression does not assume Gaussian class-conditional density
def comp_log_ls(data):
x_train, t_train = data
x1_test, x2_test = np.meshgrid(np.linspace(-5, np.ceil(np.max(x_train)), 100),
np.linspace(-5, np.ceil(np.max(x_train)), 100))
x_test = np.array([x1_test, x2_test]).reshape(2, -1).T
t_ls = LeastSquares().fit(x_train, t_train).predict(x_test)
t_lr = Logistic().fit(x_train, t_train).predict(x_test)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t_train)
plt.contourf(x1_test, x2_test, t_ls.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3),
cmap=matplotlib.colors.ListedColormap(['yellow','green','purple']))
plt.xlim(-5, np.ceil(np.max(x_train)))
plt.ylim(-5, np.ceil(np.max(x_train)))
plt.gca().set_aspect('equal', adjustable='box')
plt.title("Least Squares")
plt.subplot(1, 2, 2)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t_train)
plt.contourf(x1_test, x2_test, t_lr.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3),
cmap=matplotlib.colors.ListedColormap(['yellow','green','purple']))
plt.xlim(-5, np.ceil(np.max(x_train)))
plt.ylim(-5, np.ceil(np.max(x_train)))
plt.gca().set_aspect('equal', adjustable='box')
plt.title("Logistic Regression")
plt.show()
comp_log_ls(train_data['2-class_ex'])
comp_log_ls(train_data['3-class'])
The maximum likelihood estimate in logistic regression can exhibit severe over-fitting for data sets that are linearly separable. If the data set is linearly separable, any decision boundary separating the two classes have the property
\begin{align*} \mathbf{w}^\intercal \mathbf{x}_i \left\{\begin{aligned} \ge 0 & \text{ if } t_i = 1 \text{ or } \mathbf{x}_i \in \mathcal{C}_1 \\ < 0 & \text{ if } t_i = 0 \text{ or } \mathbf{x}_i \in \mathcal{C}_2 \end{aligned}\right. \end{align*}In equation (Error), minimizing the negative log-likelihood will push $y_i \triangleq \sigma(\mathbf{w}^\intercal \mathbf{x}_i) = t_i$ for all $i$. It is possible in separable case by simply having the magnitude of $\mathbf{w}$ goes to infinity. Thus, $\sigma(\mathbf{w}^\intercal \mathbf{x})$ becomes a Heaviside function. Maximal likelihood could provide any solution that separating two classes which does not have good predictive capability.
Application: MNIST¶
There is a set of 60,000 training images, plus 10,000 test images, assembled by the National Institute of Standards and Technology (NIST). Each image is a gray scale 28 $\times$ 28 pixels handwritten digits. we're trying to classify images into their 10 categories (0 through 9).
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images_reshape = train_images.reshape((60000, 28 * 28))
test_images_reshape = test_images.reshape((10000, 28 * 28))
def showimg(data, idx):
span = 5
if data=='train':
if idx+span<train_images.shape[0]:
images = train_images
labels = train_labels
else:
print('Index is out of range.')
if data=='test':
if idx+span<test_images.shape[0]:
images = test_images
labels = test_labels
else:
print('Index is out of range.')
plt.figure(figsize=(20,4))
for i in range(span):
plt.subplot(1, 5, i + 1)
digit = images[idx+i]
plt.imshow(digit, cmap=plt.cm.binary)
plt.title('Index:{}, Label:{}'.format(idx+i, labels[idx+i]), fontsize = 15)
plt.show()
showimg('train', 7)
# interact(showimg,
# data = widgets.RadioButtons(options=['train', 'test'],
# value='train', description='Data:', disabled=False),
# idx = widgets.IntText(value=7, description='Index:', disabled=False));
We apply logistic regression on training datasets and predict testing images. We use the SAGA algorithm because it is fast when the number of samples is significantly larger than the number of features and is able to finely optimize non-smooth objective functions.
logisticReg = LogisticRegression(multi_class='multinomial',
penalty='l1', solver='saga', tol=0.1)
logisticReg.fit(train_images_reshape, train_labels)
predictions = logisticReg.predict(test_images_reshape)
score = logisticReg.score(test_images_reshape, test_labels)
print("{}% of test data is predicted correctly!".format(score*100))
92.45% of test data is predicted correctly!
index = 0
misclassifiedIndexes = []
for label, predict in zip(test_labels, predictions):
if label != predict:
misclassifiedIndexes.append(index)
index +=1
print("{} in 10000 testing images are misclassified!".format(len(misclassifiedIndexes)))
plt.figure(figsize=(20,4))
for plotIndex, badIndex in enumerate(misclassifiedIndexes[0:5]):
plt.subplot(1, 5, plotIndex + 1)
plt.imshow(np.reshape(test_images_reshape[badIndex], (28,28)), cmap=plt.cm.gray)
plt.title('Predicted: {}, Actual: {}'.format(predictions[badIndex], test_labels[badIndex]), fontsize = 15)
plt.show()
755 in 10000 testing images are misclassified!
As we have a multi-class problem, the coefficient has shape (number of classes, number of features) = (10, 28 $\times$ 28). Thus, each coefficient can be demonstrated as an image.
Classification vector $k$ can be interpreted as follows: if a testing image has a high probability to be classified into class $k$, comparing to the class $k$ image below, then we expect to see more pixels at the blue area (high weight) and less pixels at the red area (low weight) in the testing image.
coef = logisticReg.coef_.copy()
scale = np.abs(coef).max()
fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(15,6))
for i, ax in enumerate(axes.flat):
im = ax.imshow(coef[i].reshape(28, 28), interpolation='nearest',
cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
ax.set_xticks(())
ax.set_yticks(())
ax.set_xlabel('Class %i' % i)
plt.suptitle('Classification Vectors')
fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
fig.colorbar(im, cax=cbar_ax)
plt.show()
Probit Regression¶
We have seen that the posterior class probabilities are given by $f(a)$ where
$f$ is a logistic (or softmax) transformation acting on $a$,
$a$ is a linear function of the feature variables.
Is there other way to define $f(\cdot)$ such that $f(a) =$ the posterior class probability $p(t=1|a)$? To answer the question, we want to model the posterior probability $p(t=1|a)$ directly instead of using Bayes' theorem.
Let target value
\begin{align*} t_i = \left\{ \begin{aligned} 1 & \text{ if } a_i=\mathbf{w}^\intercal \mathbf{\phi}_i \ge \theta \\ 0 & \text{ otherwise.} \end{aligned}\right. \end{align*}If $\theta$ has a probability density $p(\theta)$ (assumed to be a standard Gaussian), then
\begin{align*} p(t=1|a) = \int_{-\infty}^{a} p(\theta) d\theta = \int_{-\infty}^{a} \mathcal{N}(\theta|0,1) d\theta = \Phi(a) = \frac{1}{2} \left(1+\frac{1}{\sqrt{2}}\text{erf}(a) \right) \end{align*}The generalized linear model based on activation function
\begin{align*} f(a) = \frac{1}{2} \left(1+\frac{1}{\sqrt{2}}\text{erf}(a) \right) \end{align*}is known as probit regression.
The tails of the logistic sigmoid decay asymptotically like $\exp(-x)$ for $x \to \infty$
The tails of the probit decay asymptotically like $\exp(-x^2)$ for $x \to \infty$
The probit model is significantly more sensitive to outliers.
Mislabelling¶
Both the logistic and the probit models assume the data is correctly labelled. The effect of mislabelling is easily incorporated into a probabilistic model by introducing a probability $\epsilon$ that the target value $t$ has been flipped to the wrong value, so
\begin{align*} p(t|\mathbf{x}) = (1-\epsilon) \sigma(\mathbf{x}) + \epsilon (1-\sigma(\mathbf{x})) \end{align*}where $\epsilon$ may be set in advance, or it may be treated as a hyperparameter whose value is inferred from the data.
The two plots above nicely confirm what we have discussed before: Where the PCA accounts for the most variance in the whole dataset, the LDA gives us the axes that account for the most variance between the individual classes.
Bayesian Logistic Regression¶
Exact Bayesian inference for logistic regression is intractable because the posterior distribution comprises a product of logistic sigmoid functions
Posterior Distribution¶
The posterior distribution over $\mathbf{w}$ is
\begin{align*} p(\mathbf{w}|\mathbf{t}) \propto p(\mathbf{t}|\mathbf{w})p(\mathbf{w}) \end{align*}with a Gaussian prior $p(\mathbf{w}) = \mathcal{N}(\mathbf{w}|\mathbf{m}_0, \mathbf{S}_0)$. From equation (Likelihood),
\begin{align} \ln p(\mathbf{w}|\mathbf{t}) & = -\frac{1}{2} (\mathbf{w}-\mathbf{m}_0)^\intercal \mathbf{S}^{-1}_0 (\mathbf{w}-\mathbf{m}_0) \nonumber \\ & \quad + \sum_{i=1}^{n} \left( t_i \ln y_i + (1-t_i) \ln (1-y_i) \right) + \text{const} \nonumber \end{align}The posterior distribution $p(\mathbf{w}|\mathbf{t})$ is not Gaussian due to the second term.
We seek a Gaussian representation
for the posterior distribution by using Laplace approximation.
Laplace Approximation is to find a Gaussian approximation to a probability density $p(z)$
\begin{align*} p(z) &= \frac{1}{Z} f(z) \text{ where } Z = \int f(z) dz \\ p(z) & \sim \mathcal{N}(z|z_0, A^{-1}) \text{ where $z_0$ is a mode of } p(z) \end{align*}$f^\prime (z_0) = 0$ and $A = - \nabla \nabla \ln f(z_0)$. Laplace framework is based purely on the aspects of the true distribution at mode of the variable, and so can fail to capture important global properties. In Bayesian logistic regression,
\begin{align}
\mathbf{S}_N &= -\nabla \nabla \ln p(\mathbf{w}|\mathbf{t}) = \mathbf{S}^{-1}_0 \sum_{i=1}^{n} y_i(1-y_i) \phi_i \phi_{i}^\intercal \nonumber
\end{align}
Predictive Distribution¶
The predictive distribution
\begin{align} p(\mathcal{C}_1|\phi,\mathbf{t}) &= \int p(\mathcal{C}_1|\phi,\mathbf{w}) p(\mathbf{w}|\mathbf{t})dw \approxeq \int \sigma(\mathbf{w}^\intercal \phi) q(\mathbf{w}) dw \nonumber \\ &= \int \sigma(a) \left(\int \delta(a-\mathbf{w}^\intercal \phi) q(\mathbf{w})\right) da \equiv \int \sigma(a) p(a) da\nonumber \\ & \approxeq \int \sigma(a) \mathcal{N}(a|\mu_a, \sigma^2_a) da \quad \quad \text{Apply Laplace approximation}\nonumber \end{align}where $\delta$ is Dirac function, $a = \mathbf{w}^\intercal \phi$, and
\begin{align} \mu_a &= E(a) = \int p(a) a da = \int q(\mathbf{w}) \mathbf{w}^\intercal \phi dw = \mathbf{w}^T_{MAP} \phi \nonumber \\ \sigma^2_a &= \text{var}(a) = \int p(a) (a^2 - \mu_a^2) da = \phi^\intercal \mathbf{S}_N \phi \nonumber \end{align}The predictive distribution is the convolution of a Gaussian with a logistic sigmoid, and cannot be evaluated analytically,
We can use probit function $\Phi$, then the convolution can be expressed analytically.
x1_test, x2_test = np.meshgrid(np.linspace(-5, 12, 100), np.linspace(-5, 12, 100))
x_test = np.array([x1_test, x2_test]).reshape(2, -1).T
funcs = {'Linear Discriminant Analysis':'LinearDiscriminantAnalysis',\
'Quadratic Discriminant Analysis':'QuadraticDiscriminantAnalysis',\
'User Logistic':'Logistic', 'Logistic Regression':'LogisticRegression',\
'Bayesian Ridge':'BayesianRidge'}
from ipywidgets import interact, widgets
widget_layout = widgets.Layout(width='270px', height='30px')
@interact(method = widgets.Dropdown(description='Methods:',\
style = {'description_width': 'initial'},\
options=['User Logistic','Logistic Regression','Linear Discriminant Analysis',\
'Quadratic Discriminant Analysis','Bayesian Ridge'],\
value='User Logistic',\
layout=widget_layout))
def classify(method):
plt.figure(figsize=(16, 4))
for i, (key, value) in enumerate(train_data.items()):
plt.subplot(1, 3, i+1)
x_train, t_train = value
t = globals()[funcs[method]]().fit(x_train, t_train).predict(x_test)
plt.scatter(x_train[:,0], x_train[:,1], c=t_train, cmap=matplotlib.colors.ListedColormap(['red','green','blue']))
plt.contourf(x1_test, x2_test, t.reshape(100, 100), alpha=0.3, levels=np.array([0., 0.5, 1.5, 2.]),
cmap=matplotlib.colors.ListedColormap(['yellow','green','purple']))
plt.xlim(-5, 12)
plt.ylim(-5, 12)
plt.colorbar()
plt.gca().set_aspect('equal', adjustable='box')
plt.title("{}".format(key))
plt.show()
interactive(children=(Dropdown(description='Methods:', layout=Layout(height='30px', width='270px'), options=('…show code
Information Criterion¶
- Akaike information criterion (AIC):
$\ln p(\mathcal{D}|\mathbf{w}_{\text{ML}}) - M$, where $p(\mathcal{D}|\mathbf{w}_{\text{ML}})$ is the best-fit log likelihood and $M$ is number of parameters in $\theta$.
- Bayesian Information Criterion (BIC):
$\ln p(\mathcal{D}|\mathbf{w}_{\text{MAP}}) - \frac{1}{2} M \ln n$, where $n$ is the number of data points.
The intuition of Bayesian Information Criterion
Laplace approximate gives
\begin{align} Z &= \int f(\mathbf{z}) d\mathbf{z} \approx f(\mathbf{z}_0) \int \exp \left\{ -\frac{1}{2} (\mathbf{z}-\mathbf{z}_0)^\intercal \mathbf{A} (\mathbf{z} - \mathbf{z}_0)\right\} d\mathbf{z} = f(\mathbf{z}_0) \frac{(2\pi)^{M/2}}{|\mathbf{A}|^{1/2}} \label{eqn_bic} \end{align}where
\begin{align*} \mathbf{A} = - \nabla \nabla \ln f(\mathbf{z})|_{\mathbf{z} = \mathbf{z}_0} \end{align*}From Bayes' theorem, the model evidence is
\begin{align*} p(\mathcal{D}) & = \int p(\mathcal{D}|\theta) p(\theta) d\theta \end{align*}identifying $Z = p(\mathcal{D})$ and $f(\theta) = p(\mathcal{D}|\theta) p(\theta)$), \eqref{eqn_bic} gives
\begin{align*} \ln p(\mathcal{D}) &\approx \ln p(\mathcal{D}|\mathbf{w}_{\text{MAP}}) + \underbrace{\ln p(\mathbf{w}_{\text{ML}}) + \frac{M}{2} \ln (2\pi) - \frac{1}{2} \ln |A|}_{\text{Occam factor}} \nonumber \\ &\approx \ln p(\mathcal{D}|\mathbf{w}_{\text{MAP}}) - \frac{1}{2} M \ln n \end{align*}"Occam factor" penalizes model complexity.
$\blacksquare$